Towards Syntactically Contrained Statistical Word Alignment
نویسنده
چکیده
In statistical machine translation, the fundamental problem of word alignment is the process of finding word-to-word connections (i.e. translations) across languages given a sentence in one language and its translation in another. In more formal terms, given a source-language sentence F of n words (f1, f2, ..., fn) and a target-language sentence E of m words (e1, e2, ..., em), an alignment is a mapping between subsets of F (elements of the power set 2 ) and subsets of E (elements of 2). Instead of a mapping between full subsets, an alignment is usually indicated as a collection of links, each of which connects some fj (1 ≤ j ≤ n) to some ei (1 ≤ i ≤ m). The total collection of links makes up the alignment for the given sentence pair. In the general case, the total number of possible alignments, called the alignment space, is extremely large. With no restrictions in place and an n-word sentence pair, there are n possible alignment links and 2 2 possible alignments. If a one-to-one constraint is enforced, such that one word in F may only align to one word in E, this exponential space can be reduced to n!. Additional constraints may further restrict the alignment space or lead to related spaces (Cherry and Lin, 2006a). The natural goal of constrained alignment is to restrict the alignment space in such a way that “bad” or linguistically very unlikely alignments are ruled out while “good” or linguistically sound alignments remain possible or are preferred. Word alignment is most commonly carried out within the scope of a parallel sentence represented as a flat stream of plain-text words or as a flat stream of sets of feature–value pairs. However, in the realm of natural language, it is also possible to represent the structure inherent in a sentence; further, the structure can provide useful information about what alignments are “good” and what alignments are “bad” beyond what information can be extracted from a flat string. In this paper, we will consider a number of techniques for representing different levels of syntactic structure in the alignment process and examine the benefit of the information it provides. First, in Section 2, we briefly describe basic statistical alignment models that do not take into account any overt representation of the syntax of the sentence they are aligning. A number of published extensions to or replacements for the base models will be discussed in Section 3; these approaches all explicitly model some level of structure on one or both sides of the parallel sentence pair. Section 4 considers tradeoffs that these models introduce, compares their expressive and restrictive powers, and concludes the paper with some possible avenues for future alignment research.
منابع مشابه
Tuning Syntactically Enhanced Word Alignment for Statistical Machine Translation
We introduce a syntactically enhanced word alignment model that is more flexible than state-of-the-art generative word alignment models and can be tuned according to different end tasks. First of all, this model takes the advantages of both unsupervised and supervised word alignment approaches by obtaining anchor alignments from unsupervised generative models and seeding the anchor alignments i...
متن کاملAutomatic Phrase Alignment Using statistical n-gram alignment for syntactic phrase alignment
A parallel treebank consists of syntactically annotated sentences in two or more languages, taken from translated (i.e. parallel) documents. These parallel sentences are linked through alignment. Much work has been done on sentence and word alignment, but not as much on the intermediate level. This paper explores using n-gram alignment created for statistical machine translation based on GIZA++...
متن کاملExploiting Word Transformation in Statistical Machine Translation from Spanish to English
This paper investigates the use of morphosyntactic information to reduce datasparseness in statistical machine translation from Spanish to English. In particular, word-alignment training is performed by applying different word transformations using lemmas and stems. It has been observed that stem-based training is better than lemma-based training when up to 1 million running words of data are u...
متن کاملDependency Treelet Translation: Syntactically Informed Phrasal SMT
We describe a novel approach to statistical machine translation that combines syntactic information in the source language with recent advances in phrasal translation. This method requires a source-language dependency parser, target language word segmentation and an unsupervised word alignment component. We align a parallel corpus, project the source dependency parse onto the target sentence, e...
متن کاملMeasuring Word Alignment Quality for Statistical Machine Translation
Automatic word alignment plays a critical role in statistical machine translation. Unfortunately the relationship between alignment quality and statistical machine translation performance has not been well understood. In the recent literature the alignment task has frequently been decoupled from the translation task, and assumptions have been made about measuring alignment quality for machine t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008